Chapter 1
IN THIS CHAPTER
Getting up to speed on the prerequisites for biostatistics
Understanding the human research environment
Surveying the specific procedures used to analyze biological data
Estimating how many participants you need
Working with distributions
Biostatistics deals with the design and execution of scientific studies involving biology, the acquisition and analysis of data from those studies, and the interpretation and presentation of the results of those analyses. This book is meant to be a useful and easy-to-understand companion to the more formal textbooks used in graduate-level biostatistics courses. Because most of these courses teach how to analyze data from epidemiologic studies and clinical trials, this book focuses on that as well. In this first chapter, we introduce you to the fundamentals of biostatistics.
Chapters 2 and 3 are designed to bring you up to speed on the basic math and statistical background that’s needed to understand biostatistics and give you supplementary information or context that you may find useful while reading the rest of this book.
For instructional purposes, some chapters in this book include step-by-step instructions for performing statistical tests and analyses by hand. We include such instruction only to illustrate the concepts that are involved in the procedure or to demonstrate calculations that are simple to do manually.
However, we demonstrate many of the statistical functions we talk about in this book using R, which is a free, open-source software package. If you are in a class and assigned a particular software package to use, you will have to use that software for the course, which may be commercial software associated with a fee. However, if you are learning on your own, you may choose to use open-source software, which is free. Chapter 4 provides guidance on both commercial and free software.
Three chapters discuss clinical trials:
Much of the work in biostatistics is using data from samples to make inferences about the background population from which the sample was drawn. Now that we have large databases, it is possible to easily take samples of data. Chapter 6 provides guidance on different ways to take samples of larger populations so you can make valid population-based estimates from these samples. Sampling is especially important when doing observational studies. While clinical trials covered are experiments, where participants are assigned interventions, in observational studies, participants are merely observed, with data collected and statistics performed to make inferences. Chapter 7 describes these observational study designs, and the statistical issues that need to be considered when analyzing data arising from such studies.
Data used in biostatistics are often collected in online databases, but some data are still collected on paper. Regardless of the source of the data, they must be put into electronic format and arranged in a certain way to be able to be analyzed using statistical software. Chapter 8 is devoted to describing how to get your data into the computer and arrange it properly so it can be analyzed correctly. It also describes how to collect and validate your data. Then in Chapter 9, we show you how to summarize each type of data and display it graphically. We explain how to make bar charts, box-and-whiskers charts, and more.
Most statistical analysis involves inferring, or drawing conclusions about the population at large based on your observations of a sample drawn from that population. The theory of statistical inference is often divided into two broad sub-theories: estimation theory and decision theory.
Chapter 10 deals with statistical estimation theory, which addresses the question of how accurately and precisely you can estimate a population parameter from the values you observe in your sample. For example, you may want to estimate the mean blood hemoglobin concentration in adults with Type II diabetes, or the true correlation coefficient between body weight and height in certain pediatric populations. Chapter 10 describes how to estimate these parameters by constructing a confidence interval around your estimate. The confidence interval is the range that is likely to include the true population parameter, which provides an idea of the precision of your estimate.
Much of the rest of this book deals with statistical decision theory, which is how to decide whether some effect you’ve observed in your data reflects a real difference or association in the background population or is merely the result of random fluctuations in your data or sampling. If you measure the mean blood hemoglobin concentration in two different samples of adults with Type II diabetes, you will likely get a different number. But does this difference reflect a real difference between the groups in terms of blood hemoglobin concentration? Or is this difference a result of random fluctuations? Statistical decision theory helps you decide.
In Part 4, we cover statistical decision theory in terms of comparing means and proportions between groups, as well as understanding the relationship between two or more variables.
In Part 4, we show you different ways to compare groups statistically.
Epidemiology and biostatistics are interested in causal inference, which means trying to figure out what causes particular outcomes in biological research. While it is possible to look at the relationship between two variables in a bivariate analysis, regression analysis is the part of statistics that enables you to explore the relationship between multiple variables and one outcome in the same model so you can evaluate their relative cause of the outcome. Here are some use-cases for regression:
Regression analysis can manage all these tasks and many more. Regression is so important in biological research that all the chapters in Part 5 are focused on some aspect of regression.
But in real-world biological and epidemiologic research, you encounter more complicated relationships. Chapter 18 describes logistic regression, where the outcome is the occurrence or non-occurrence of an event (such as being diagnosed with Type II diabetes), and you want to predict the probability that the event will occur. You also find out about several other kinds of regression in Chapter 19:
Finally, Part 5 ends with Chapter 20, which provides guidance on the mechanics of regression modeling, including how to develop a modeling plan, and how to choose variables to include in models.
Sooner or later, everyone dies, and in biological research, it becomes especially important to characterize that sooner-or-later part as accurately as possible using survival analysis techniques. But characterizing survival can get tricky. It’s possible to say that patients may live an average of 5.3 years after they are diagnosed with a particular disease. But what is the exact survival experience? Imagine you do a study with patients who have this disease. You may ask: Do all patients tend to live around five or six years, or do half the patients die within the first few months, and the other half survive ten years or more? And what if some patients live longer than the observational period of your study? How do you include them in your analysis? And what about participants who stopped returning calls from your study staff? You do not know if these dropouts went on to live or die. How do you include their data in your analysis?
Statistics books always contain tables, so why should this one be any different? Back in the not-so-good old days, when analysts had to do statistical calculations by hand, they needed to use tables of the common statistical distributions to complete the calculation of the significance test. They needed tables for the normal distribution, Student t, chi-square, Fisher F, and others. Now, software does all this for you, including calculating exact p values, so these printed tables aren’t necessary anymore.
But you should still be familiar with the common statistical distributions that may describe the fluctuations in your data, or that may be referenced in the course of performing a statistical calculation. Chapter 24 contains a list of commonly used distribution functions, with explanations of where you can expect to encounter those distributions and what they look like. We also include a description of some of their properties and how they’re related to other distributions. Some of them are accompanied by a small table of critical values, corresponding to statistical significance at α = 0.05.
Of all the statistical challenges a researcher may encounter, none seems to instill as much apprehension and insecurity as having to estimate the number of participants needed for a study. While smaller sample sizes mean less data collection work, you want to make sure your target sample size is large enough so that in the end, your study has sufficient power. You want to conduct a study with a high probability of yielding a statistically significant result if the hypothesized effect is truly present in the population.